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Abstract: The advantages of VISCORS, a visual contents recommender system, for the mobile web are discussed. 
VISCORS combines the two popular information filtering techniques, collaborative filtering and content-based 
image retrieval. With the help of VISCORS, the customers can purchase content with much less search effort and 
much lower connection time. It also helps mobile web content providers in helping the profitability of their business 
because lower customer frustration in finding desired content increases revenue through an improved purchase 
conversion rate. (Edited abstract) 10 Refs. 

Descriptors: *World Wide Web; Mobile computing; Wireless telecommunication systems; Personal digital 
assistants; Intelligent agents; Image retrieval; Information analysis; Web browsers; Real time systems; Database 
systems; Multimedia systems; Feedback 
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Publication Year: 2004 

CODEN: ATISET ISSN: 1046-8188 

Language: English 

Document Type: JA; (Journal Article) Treatment: L; (Literature Review/Bibliography); T; (Theoretical) 
Journal Announcement: 0408W4 

Abstract: Recommander systems using collaborative filtering are a popular technique for reducing information 
overload and finding products to purchase. One limitation of current recommenders is that they are not portable. 
They can only run on large computers connected to the Internet. A second limitation is that they require the user to 
trust the owner of the recommender with personal preference data. Personal recommenders hold the promise of 
delivering high quality recommendations on palmtop computers, even when disconnected from the Internet. 
Further, they can protect the user's privacy by storing personal information locally, or by sharing it in encrypted 
form. In this article we present the new PocketLens collaborative filtering algorithm along with five peer-to-peer 
architectures for finding neighbors. We evaluate the architectures and algorithms in a series of offline experiments. 
These experiments show that Pocketlens can run on connected servers, on usually connected workstations, or on 
occasionally connected portable devices, and produce recommendations that are as good as the best published 
algorithms to date. 66 Refs. 
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Abstract: Recommander systems have changed the way people shop online. Recommender systems on wireless 
mobile devices may have the same impact on the way people shop in stores. We present our experience with 
implementing a recommender system on a PDA that is occasionally connected to the network. This interface helps 
users of the MovieLens movie recommendation service select movies to rent, buy, or see while away from their 
computer. The results of a nine month field study show that although there are several challenges to overcome, 
mobile recommender systems have the potential to provide value to their users today. 13 Refs. 
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Recommender systems using collaborative filtering are a popular technique for reducing information overload 
and finding products to purchase. However the economic model required to run a business around collaborative 
filtering is at odds with the end user's de sire for unvarnished recommendations. Traditional recommender systems 
are centralized and available only online which is at odds with the user's desire to have recommendations wherever 
they are. A personal recommender system will someday empower people with the technology needed to assert their 
freedom to share information of all kinds, and to take recommendations with them, wherever they go. 

In this thesis we take three steps toward the long term vision of a personal recommender system. The PocketLens 
peer-to-peer collaborative filtering algorithm, the MultiLens recommendation framework, and MovieLens 
Unplugged. 

We present the PocketLens collaborative filtering algorithm along with four peer-to-peer architectures for 
finding neighbors. We evaluate the architectures and algorithms in a series of experiments. These experiments show 
that PocketLens can run on portable and disconnected devices, give users control of their data, and produce 
recommendations that are as good as the best published algorithms to date. 

We present the MultiLens framework. A new recommendation engine capable of combining multiple dimensions 
of preference and content information into a model used to make recommendations. We identify twelve application 
patterns used by recommender applications, and show how the MultiLens framework can be used to implement 
these patterns. We experimentally evaluate the ability of MultiLens to combine a content dimension with a quality 
dimension to solve the first rater and sparsity problems in collaborative filtering. 

We present MovieLens unplugged, which examines several important challenges that interface designers must 
overcome on mobile devices: Providing sufficient value to attract prospective wireless users, handling occasionally 
connected devices, privacy and security, and surmounting the physical limitations of the devices. We present our 
experience with the implementation of a wireless movie recommender system on a cell phone browser, an 
AvantGo channel, a wireless PDA , and a voice-only phone interface. These interfaces help MovieLens users select 
movies to rent, buy, or see while away from their computer. 
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Conference Title: Database and Expert Systems Applications. 17th International Conference, DEXA 2006. 
. Proceedings 

Conference Date: 4-8 Sept. 2006 Conference Location: Krakow, Poland 

Language: English Document Type: Conference Paper (PA) 
Treatment: Practical (P); Theoretical (T) 

Abstract: Utilizing Global Positioning System (GPS) technology, it is possible to find and recommend restaurants 
for users operating mobile devices. For recommending restaurants, personal digital assistants or cellular phones 
only consider the location of restaurants. However, a user's background and environment information is assumed to 
be directly related to recommendation quality. In this paper, therefore, a recommender system using context 
information and a decision tree model for efficient recommendation is presented. This system considers location 
context, personal context, environment context, and user preference. Restaurant lists are obtained from location 
context, personal context, and environment context using the decision tree model. In addition, a weight value is used 
for reflecting user preferences. Finally, the system recommends appropriate restaurants to the mobile user. For this 
experiment, performance was verified using measurements such as k-fold cross-validation and mean absolute error. 
As a result, the proposed system obtained an improvement in recommendation performance. ( 15 Refs) 
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Conference Title: PRICAI 2004: Trends in Artificial Intelligence. 8th Pacific Rim International Conference on 
Artificial Intelligence. Proceedings 

Conference Date: 9-13 Aug. 2004 Conference Location: Auckland, New Zealand 

Language: English Document Type: Conference Paper (PA) 
Treatment: Practical (P) 

Abstract: As mobile Internet technology becomes more increasingly applicable, the mobile contents market, 
especially character image downloading for mobile phones, has recorded remarkable growth. In spite of this rapid 
growth, however, most of the customers experience inconvenience, lengthy search processes and frustration in 
searching for the specific character images they want due to inefficient sequential search. This article describes a 
personalized image recommender system designed to reduce customers' search efforts in finding desired character 
images on the mobile Internet. The system combines two of the most popular information filtering techniques: 
collaborative filtering and content-based image retrieval. ( 2 Refs) 
Subfile: C 
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Abstract: Current search methods for mobile-Web content can be frustrating to use. To shorten searches for cell 
phone wallpaper images, VISCORS combines collaborative filtering with content-based image retrieval. An 
increasing selection of content is becoming available in the mobile-Web environment, where users navigate the Web 
using wireless devices such as cell phones and PDAs. The fast growth and excellent prospects of the mobile- Web 
content market have attracted many content providers. ( 10 Refs) 
Subfile: C 
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Abstract: Describes a personalized recommender system that has been designed to suggest new products to 
supermarket shoppers. The recommender functions in a pervasive computing environment, namely a remote 
shopping system in which supermarket customers use personal digital assistants (PDAs) to compose and transmit 
their orders to the store, which assembles them for subsequent pickup. The recommender is meant to provide an 
alternative source of new ideas for customers who now visit the store less frequently. Recommendations are 
generated by matching products to customers based on the expected appeal of the product and the previous spending 
of the customer. Association mining in the product domain is used to determine relationships among product classes 
for use in characterizing the appeal of individual products. Clustering in the customer domain is used to identify 
groups of shoppers with similar spending histories. Cluster-specific lists of popular products are then used as input 
to the matching process. The recommender is currently being used in a pilot program with several hundred 
customers. Analysis of the results to date have shown a 1.8% boost in program revenue as a result of purchases 
made directly from the list of recommended products. A substantial fraction of the accepted recommendations are 
from product classes new to the customer, indicating a degree of willingness to expand beyond present purchase 
patterns in response to reasonable suggestions. ( 21 Refs) 
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Abstract: As an example of distributed personalization in a pervasive computing environment, this paper describes 
a PDA (personal digital assistant) based personal "wine guru" agent that works hand-in-hand with an on-board 
supermarket shopping program, and also with a server-based data-mining program that provides personalised wine 
lists. The guru ascertains which list of the store's wines from the server best matches the user's tastes in wine, keeps 
an eye on the user's on-board wine-cellar list, and can read the shopping list that the user is preparing and then 
suggest wines to go with some of the items after asking how they will be prepared. The user may elect to add some 
of the suggested wines to the shopping list, or choose from the wine cellar. For portability, a subset of Java with a 
virtual machine small enough to fit on to a PalmPilot was used. It provided all the needed functionality and an 
acceptable response time. ( 12 Refs) 
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Abstract: Current search methods for mobile Web content can be frustrating to use. To shorten searches for 
cellphone wallpaper images, ViscoRs combines collaborative filtering with content based image retrieval. 
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Cellular phone ringing tone recommendation system based on collaborative filtering method 
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Abstract: 

We have developed a prototype of cellular phone ringing tome recommendation system using memory-based 
collaborative filtering and we have carried out examinations to evaluate its performance. The ringing tone content 
was stored on a server from where the users were able to download the desired items according to their preferences. 
An extensive log data accumulated at the download service site for a fixed period of time was used. The log data 
contained only information for the users 1 downloaded ringing tomes without evaluation data. The user set and the 
tone downloadable content set were not fixed and our goal was to investigate how collaborative filtering could be 
successfully applied to a system with such continuously changing conditions. The Jaccard's similarity coefficient 
was used to calculate the similarity between the users. The learning period, the recommendation period and the 
number of the similar users were used as condition parameters. The system quality evaluation showed that the recall 
increases with the increase of the learning period but decreases with the increase of the recommendation period. 
Optimal values for the number of the most similar users as well as for the learning and the recommendation periods 
were experimentally obtained. It was shown that the collaborative filtering method could be successfully applied to 
a cellular phone ringing tone recommendation system. 
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Abstract: Collaborative filtering, often used in E-commerce applications, is a method to cluster similar users 
based on their profiles, characteristics or attitudes on specific subjects. This paper proposes a novel method to 
implement dynamic collaborative filtering by Genetics-based machine learning, in which we employ Learning 
Classifier Systems extended to multiple environments. The proposed method is used in a yet another mobile agent 
system: a distributed smart IC card system. The characteristics of the proposed method are summarized as follows: 
(1) It is effective in distributed computer environments with PCs even for small number of users. (2) It learns 
users' profiles from the individual behaviors of them then generates the recommendation and advices for each user. 
(3) The results are automatically accumulated in a local system on a PC, then they are distributed via smart IC cards 
while the users are interacting with the system. The method has been implemented and validated in Group Trip 
Advisor prototype: a PC-based distributed recommendor system for travel information. ( 12 Refs) 
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The explosive growth of information available on the Internet has created a clear need for novel methods that help 
users locate relevant information quickly and with minimal effort. The central argument of this dissertation is that 
learning about users' multiple and potentially changing interests calls for algorithms specifically designed for this 
purpose. Guided by this principle, I introduce the <italic> Adaptive Information Server</italic> (<italic> 
AIS</italic>), a client-server framework for domain-independent adaptive information access. I describe two 
applications that use <italic>AIS</italic> to learn about users' interests in daily news stories: one system operates on 
the World Wide Web, the other is geared towards wireless information access. The description and evaluation of the 
underlying recommendation algorithms form the core of the dissertation. 

First, I describe a content-based learning algorithm designed to learn about users' multiple and frequently 
changing interests. The key to the algorithm's performance lies in its multi-strategy design: it learns separate models 
of users' short-term and long-term interests. An empirical evaluation shows that the combination of both models 
performs better than each individual model alone. In addition, the algorithm maintains a model of information the 
user is likely to know, so that the presentation of redundant content can be avoided. Second, I show how the 
described content-based algorithm can be extended with a collaborative filtering component. In particular, I cast 
collaborative filtering as a learning task, and present a novel algorithm that uses the Singular Value Decomposition 
to derive a low-dimensional data representation that forms the basis of an efficient and accurate approach to 
collaborative filtering. Empirical results demonstrate that the resulting approach outperforms previously proposed 
algorithms, and that combining content-based and collaborative techniques leads to overall performance 
improvements. 

The dissertation concludes with a description of two empirical studies that evaluate the utility of adaptive 
information access from a user perspective. These studies show that the adaptive presentation of personalized 
content simplifies access to relevant information, and that the observed performance can be achieved without 
requiring any extra work from the user. 
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Abstract 

Collaborative Filtering (CF) technique has been proved to be one of the most successful techniques in 
recommender systems in recent years. However, most existing CF based recommender systems worked in a 
centralized way and suffered from its shortage in scalability as their calculation complexity increased quickly both 
in time and space when the record in user database increases. In this article, we first propose a distributed CF 
algorithm called PipeCF together with two novel approaches: significance refinement and unanimous 
amplification, to further improve the scalability and prediction accuracy. We then show how to implement this 
algorithm on a Peer-to-Peer (P2P) structure through distributed hash table method, which is the most popular 
and efficient P2P routing algorithm, to construct a scalable distributed recommender system. The experimental 
data show that the distributed CF-based recommender system has much better scalability than traditional 
centralized ones with comparable prediction efficiency and accuracy. 
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1. Introduction 

Recommender system is a system that helps users to find their wanted items by making recommendations 
based on either the content of the recommended items (Content-based Filtering), or ratings of similar users on 
the recommended items (Collaborative Filtering, CF). Since Goldberg, Nichols, Oki, and Terry (1992) published 
the first account of using CF for information filtering; CF has proved to be one of the most successful techniques 
in recommendation systems by its advantage of that no explicit description of items is needed. The key idea of 
CF is that users will prefer those items that people with similar interests prefer, or even that dissimilar people do 
not prefer, so most CF algorithms can be separated into three steps as addressed by Herlocker, Konstan, 
Borchers, and Riedl (1999): (1) Similarity Weight: weight all users with respect to similarity with the active user, 
which refer to the user whose preferences are to be predicted; (2) Selecting Neighborhoods: select those users 
used to make prediction; (3) Rating Normalization and Prediction Making: normalize and calculate the weighted 
sum of selected users' ratings, then make prediction based on that. According to different techniques used in the 
first part mentioned above, CF algorithms can be divided into two classes: memory-based algorithms and model- 
based algorithms. Breese et al. performed an empirical analysis on both of two kinds of CF algorithms in Breese, 
Heckerman, and Kadie (1998) while Herlock et al. presented an algorithmic framework for performing CF in 
Herlocker et al. (1999). 

GroupLens (Resnick, lacovou, Suchak, Bergstrom, & Riedl, 1994) was the first CF algorithm to automate 
prediction and used a memory-based algorithm. Like most memory-based algorithms, GroupLens need to 
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compute across the whole user database to calculate the similarities between active user and other users to 
make prediction. Ringo ( Shardanand & Maes, 1995) only used those neighbors whose correlation were greater 
than a given threshold to make prediction. This approach not only reduced the calculation complexity but also 
proved to improve the performance. By choosing top-N users with the highest correlations the same 
improvement can also been obtained. However, all the other users' similarities still have to be calculated and its 
complexity increased quickly both in time and space as the record in the database increases. 

Basically, there are two ways to reduce this calculation complexity. The first one is used a model-based 
algorithm which first constructs some certain mathematical models, such as Bayesian Network, Bayesian 
Classifiers et al., to describe the users and/or their ratings, then learns these models from the database and use 
them to make prediction. However, these approaches also need complex calculation when compiling models and 
also require a central database to keep all the user data which is not easy to achieve sometime not only for 
techniques reasons but also for privacy reasons. 

The second way is to implement CF in a decentralized way. In fact, as Peer-to-Peer (P2P) gains more and more 
popularity, some researchers have already begun to consider it as an alternative architecture to reduce the 
calculation complexity (Tveit, 2001 ; Olsson, 2003 and Canny, 2002) of centralized CF algorithms. The main 
difference between centralized CF-based recommender system and distributed ones is that the originally 
centralized user database are maintained in a decentralized way which means each peer will only keep a fraction 
of user database and when making prediction for a particular user, needed record should first be retrieved to the 
user's own database from other peers and calculated locally. In order to do this, the following two problems have 
to be addressed: (1 ) how to store the user database distributed efficient so that needed information can be found 
efficiently; (2) how to identify those records needed to make prediction for a particular user and fetch them 
efficiently as retrieving all other users' votes back is not only unreasonable but also unnecessary. 

The main contributions of this article are: 

(1) We propose PipeCF: a distributed CF algorithm which can be implemented on a P2P overlay network; 

(2) We propose two novel approaches: significance refinement (SR) and unanimous amplification (UA), to 
improve the performance of our distributed CF algorithm; 

(1 ) We give the framework of implementing our distributed CF algorithm on P2P overlay network through 
distributed hash table (DHT) based technique to obtain efficient user database management and retrieval to 
construct decentralized CF recommender system. 

The rest of this article is organized as follows. In Section 2, several related works are presented and discussed. 
In Section 3, we introduce the architecture and key features of our DHT-based CF system. Two techniques: SR 
and UA are also proposed in this section to improve the scalability and prediction accuracy of DHT-based CF 
algorithm in this section. In Section 4 the experimental results of our system are presented and analyzed. Finally 
we make a brief concluding remark and give the future work in Section 5. 

2. Related work 

2.1. Memory-based CF algorithm 

Generally, the task of CF is to predict the votes of active users from the user database which consists of a set of 
votes v n corresponding to the vote of user / on item j. Memory-based CF algorithm calculates this prediction of 

as a weighted average of other users votes on that item through the following formula: 
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P*J = + * X W («■ J) (WJ - *i) ( 1 ) 

where P ■ denotes the prediction of the vote for active user a on item j and n is the number of users in user 
database. >'t is the mean vote for user / as: 

*i = T7T S V U (2) 

where is the set of items on which user / has voted. The weights m{aj) reflect the similarity between active 
user and users in the user database, k is a normalizing factor to make the absolute values of the weights sum to 
unity. 

Most memory-based algorithms use Eq. (1) to make prediction and only distinguish between the ways they 
calculate the weights: 

2.1.1. Pearson correlation coefficient 

Pearson correlation coefficient was first introduced into collaborative filtering as a weighting method in the 
GroupLens project The correlation between user a and / is: 

where the summations is calculated over those items for which both users a and / have voted. 

2.1.2. Vector similarity 

The vector similarity was first used to measure the similarity between two documents. Each document was 
viewed as a vector of word frequency and their similarity was computed as the cosine of the angle between 
these two vectors. In Collaborative Filtering, we treat each user record as a document and their votes as 
frequency of items. So the weights can now be calculated as: 



m{aj) = £ 



2.2. P2P system and DHT routing algorithm 



(4) 



The term 'Peer-to-Peer' refers to a class of systems and applications that employ distributed resources to 
perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is 
increasingly receiving attention in research and more and more P2P systems have been deployed on the 
Internet. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on 
centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; 
and enabling resource aggregation. Among all these applications, three main classes of peer-to-peer 
applications have emerged: parallelizable, content and file management, and collaborative. 

As the main purpose of P2P systems are to share resources among a group of computers called peers in a 
distributed way, efficient and robust routing algorithms for locating wanted resource is critical to the performance 
of P2P systems. Among these algorithms, distributed hash table (DHT) algorithm is one of the most popular and 
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effective and supported by many P2P systems such as CAN (Ratnasamy, Francis, Handley, Karp, & Shenker, 
2001), Chord ( Stocal et ai., 2001), Pastry ( Rowstron & Druschel, 2001), and Tapestry ( Zhao et al., 2001). 

A DHT overlay network is composed of several DHT nodes and each node keeps a set of resources (e.g. files, 
rating of items). Each resource is associated with a key (produced, for instance, by hashing the file name) and 
each node in the system is responsible for storing a certain range of keys. Peers in the DHT overlay network 
locate their wanted resource by issue a lookup(key) request which returns the identity (e.g. the IP address) of the 
node that stores the resource with the certain key. The primary goals of DHT are to provide an efficient, scalable, 
and robust routing algorithm which aims at reducing the number of P2P hops, which are involved when we locate 
a certain resource, and to reduce the amount of routing state that should be preserved at each peer. In Chord 
(Stocal et al., 2001 ), each peer keeps track information of logN other peers (N is the total number of peers in the 
community). When a peer joins and leaves the overlay network, this highly optimized version of DHT algorithm 
will only require notifying logN peers about that change. 

3. Our distributed CF algorithm 

3.1. Basic PipeCF algorithm 

The first step to implement CF algorithm in a distributed way is to divide the original centralized user database 
into fractions which can be stored in Peers distributed For concision, we use the term bucket to denote one 
fraction of the whole user database in the following of this article. Each bucket should also be assigned an 
identifier through which they can be located later when needed. The way we do the division is to make each 
bucket hold a group of users' record who has at least rated one item with the same vote. By that we construct 
one bucket for every different <ITEM_ID, VOTE> tuple and use that tuple as the identifier for the bucket. Fig. 1 
shows our division strategy: 



Display Full Size version of this image (4K) 

Fig, 1 . User database division strategy. 

In order to reduce the calculation complexity and achieve scalability, we wish to find a strategy which can choose 
neighbors from the most suitable buckets when making prediction. The PipeCF algorithm chooses neighbors 
based on the heuristic that people with similar interests at least rate one item with similar votes. As we can see in 
Fig. 2, this strategy have very high hitting ratio. So when making prediction, the PipeCF only uses those users' 
records who are in at least one same bucket with the active users. Through which we reduce about 50% 
calculation than traditional CF algorithm and obtain comparable prediction as shown in Figure 8. 
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Fig. 2. Architecture of DHT-based CF recommender system. 
3.2. Improved PipeCF algorithm 

3.2.1. Significance refinement 

In the basic PipeCF algorithm, we use all users which are in the same bucket with the active user and find that 
the algorithm has an 0{N) fetched user number (N is the total user number) as Fig. 8 shows. In fact, as Breese 
et al. (1998) presented by the term inverse user frequency, universally liked items are not as useful as less 
common items in capturing similarity. So we introduce a new concept SR, which reduces the returned user 
number of the basic PipeCF algorithm by limiting the number of returned users for each bucket. We term the 
algorithm improved by SR as Return K which means 'for every item, the PipeCF algorithm returns no more than 
K users for each bucket'. The experimental result in Section 5.3.3 shows that this method reduces the returned 
user number dramatically and also improves the prediction accuracy. 

3.2.2. Unanimous amplification 

Enlightened by the method of case amplification (Breese et al., 1998) which emphasizes the contribution of the 
most similar users to the prediction by amplifying the weights close to 1, we argue that we should give special 
award to the users who rated some items with the same vote by amplify their weights, which we term UA We 
transform the estimated weights as follows: 



improves the prediction accuracy of the PipeCF algorithm. 

4. DHT-BASED CF recommender system 

4.1. Architecture of DHT-based CF recommender system 

The main advantage of our DHT-based CF recommender system vs. traditional centralized CF recommender 
system is that both the maintenance of user database and the complex computation task of making prediction 
are done in a decentralized way so as to obtain better scalability. Here, we treat each bucket as resource and the 
unique key for the resource is generated by the particular <ITEMJD, VOTE> tuple associated with the bucket. 
Each peer in the DHT overlay network will keep one or several buckets locally. Fig. 2 gives the architecture of 
our DHT-based CF recommender system. 

So the implementation of PipeCF in DHT overlay network is straightforward except that the bucket is stored 
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our experiments is 2.0, P is 4.0, and y is 4. Experimental result in Section 4.3.4 shows that UA approach 
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distributed so that when a user wants to look up other similar users which have the same particular <ITEM_ID, 
VOTE> tuple, it need fetch them from DHT overlay network. Still we need special scheme to manage the 
distributed storage of buckets. DHT has provided two function lookup(key) and put(key) which we will describe 
later to do these jobs and its efficiency has been guaranteed by the algorithm itself. So with the DHT overlay 
network, all the users in the CF system are connected together and can find their wanted similar neighbors 
efficiently through a DHT routing algorithm. 

4.2. Implementation of PipeCF on DHT 

On the basis of the decentralized storage of user votes, we introduce our implementation of PipeCF algorithm on 
the DHT overlay network, called DHT-based CF algorithm, as shown in Fig. 3. 




Display Full Size version of this image (17K) 



Fig. 3. DHT-based CF algorithm. 

There are two key pieces to the DHT-based CF system algorithm: the lookup mechanism used to locate similar 
users and fetch their actual rating. The decentralized storage (and hence decentralized retrieval) in decentralized 
CF system makes the CF calculation inherently scalable (every user do recommendation locally instead of 
depending on a centralized server); the hard part is finding the similar peers from which to retrieve the actual 
rating. 

We devise a scalable solution to the problem of locating similar users in decentralized CF system, i.e. give a user 
vote vector; we can find the IP address of the node(s) which is similar to the user. Our DHT-based solution can 
reach following goal: 

• Scalability: it must be designed to scale to several million nodes. 

• Efficiency: similar users should be located reasonably quick and with low overhead in terms of the message 
traffic generated. 

• Dynamicity: the system should be robust to frequent node arrivals and departures in order to cope with highly 
transient user populations' characteristic to decentralized environments. 

• Balanced load: in keeping with the decentralized nature, the total resource load (traffic, storage, etc) should be 
roughly balanced across all the nodes in the system. 

So we only select similar users in the subset in which users have same <ITEMJD, VOTE> tuple. The key idea of 
our algorithm is hashing every user for every rated item. Our DHT-based CF algorithm includes two main DHT 
functions: put(key) and lookup(key), and Fig. 4 and Fig. 5 show them separately. 



http://ww.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V03-4BN 4/24/2007 



ScienceDirect - Expert Systems with Applications : A scalable P2P recommender system ... Page 8 of 12 




Display Full Size version of this image (15K) 



Fig. 4. DHT-based CF put function. 
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Fig. 5. DHT-based CF lookup function. 

DHT-based CF Put algorithm is used to construct DHT overlay network and fill data in it. DHT-based CF Lookup 
algorithm is used to lookup and fetch similar uses with same <ITEM_ID, VOTE> tuple in order to construct a 
local training set to make recommendation. The main purpose of steps 2 and 3 in Fig. 3 is to make every peer in 
the DHT overlay network keep several buckets which contain a group of users with same <ITEM_ID, VOTE> 
tuple, from which the Lookup algorithm can fetch similar users later in its steps 2 and 3. 

5. Experimental evaluation 

In this section, we describe the dataset, metrics and methodology for the comparison between traditional and 
DHT-based CF algorithm, and present the results of our experiments. 

5.1. Data set 

We use Eachmovie collaborative filtering data set (1997) to evaluate the performance of improved algorithm The 
EachMovie data set is provided by the Compaq System Research Center, which ran the EachMovie 
recommendation service for 18 months to experiment with a collaborative filtering algorithm! The information they 
gathered during that period consists of 72,916 users, 1628 movies, and 2,811,983 numeric ratings ranging from 
0 to 5. To speed up our experiments, we only use a subset of the EachMovie data set. 



5.2. Metrics and methodology 

The metrics for evaluating the accuracy of a prediction algorithm can be divided into two main categories: 
statistical accuracy metrics and decision-support metrics. Statistical accuracy metrics evaluate the accuracy of a 
predictor by comparing predicted values with user-provided values. Decision-support accuracy measures how 
well predictions help user select high-quality items. We use Mean Absolute Error (MAE), a statistical accuracy 
metrics, to report prediction experiments for it is most commonly used and easy to understand: 
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where v • is the rating given to item j by user a, is the predicted value of user a on item ;', T is the test set, | T\ is 
the size of the test set. 

We select 2000 users and choose one user as active user per time and the remainder users as his candidate 
neighbors, because every user only make self s recommendation locally. We use the mean prediction accuracy 
of all the 2000 users as the system's prediction accuracy. For every user's recommendation calculation, our tests 
are performed using 80% of the user's ratings for training, with the remainder for testing. 

5.3. Experimental result 

We design several experiments for evaluating our algorithm and analyze the effect of various factors (e.g. SR 
and UA, etc.) by comparison. All our experiments are run on a Windows 2000-based PC with Intel Pentium 4 
processor having a speed of 1.8 GHz and 512 MB of RAM. 

5.3.1. The efficiency of neighbor choosing 

We used a data set of 5000 users and show among the users chosen by PipeCF algorithm, how many are in the 
top-100 users in Fig. 6. We can see from the data that when the user number rises above 1000, more than 80 
users who have the most similarities with the active users are chosen by PipeCF algorithm. 
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Fig. 6. How many users of PipeCF in traditional CF's top 100. 
5.3.2. Performance comparison 

We compare the prediction accuracy of traditional CF algorithm and PipeCF algorithm while we apply both top-all 
and top-100 user selection on them. The results are shown as Fig. 7. We can see that the DHT-based algorithm 
has better prediction accuracy than the traditional CF algorithm. 
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Fig. 7. PipeCF vs. traditional CF. 



5.3.3. The effect of significance refinement 
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We limit the number of returned user for each bucket by 2 and 5 and do the experiment in Section 5.3.2 again. 
The user for each bucket is chosen randomly. The result of the number of user chosen and the prediction 
accuracy is shown in Fig. 8 and Fig. 9, respectively. The result shows: 

(1 ) 'Return AH' has an O(N) returned user number and its prediction accuracy is also not satisfying; 

(2) 'Return 2' has the least returned user number but the worst prediction accuracy; 

(1 ) 'Return 5' has the best prediction accuracy and the scalability is still reasonably well (the returned user 
number is still limited to a constant as the total user number increases). 
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Fig. 8. The effect on scalability of SR on PipeCF. 
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Fig. 9. The effect on prediction accuracy of SR on PipeCF algorithm. 
5.3.4. The effect of unanimous amplification 

We adjust the weights for each user by using Eq. (5) while setting value for a as 2.0, p as 4.0, y as 4 and do the 
experiment in Section 4.3.2 again. We use the top-100 and 'Return All' selection method. The result shows that 
the UA approach improves the prediction accuracy of both the traditional and the PipeCF algorithm. From Fig. 10 
we can see that when UA approach is applied, the two kinds of algorithms have almost the same performance. 
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Fig. 1 0. The effect on prediction accuracy of unanimous amplification. 
6. Conclusion and future work 

In this article, we propose a novel distributed hash table (DHT) based technique to implement efficient user 
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database management and retrieval in decentralized CF system. Then we propose a heuristic algorithm to fetch 
similar users from DHT overlay network and do recommendation locally. Finally, we propose two novel 
approaches: SR and UA to improve the performance of our DHT-based CF algorithm. The experimental data 
show that our DHT-based CF system has better prediction accuracy, efficiency and scalability than traditional CF 
systems. 

Our future work includes investigation on a more efficient decentralized user database management and K- 
Nearest Neighbor (KNN) methods which can dynamically self-organize users with semantic similar interests 
combining content-based filtering techniques. We would also like to investigate on the influence of parameters 
choosing in UA. 
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Abstract 

Collaborative Filtering (CF) technique has been proved to be one of the most successful techniques in 
recommender systems in recent years. However, most existing CF based recommender systems worked in a 
centralized way and suffered from its shortage in scalability as their calculation complexity increased quickly both 
in time and space when the record in user database increases. In this article, we first propose a distributed CF 
algorithm called PipeCF together with two novel approaches: significance refinement and unanimous 
amplification, to further improve the scalability and prediction accuracy. We then show how to implement this 
algorithm on a Peer-to-Peer (P2P) structure through distributed hash table method, which is the most popular 
and efficient P2P routing algorithm, to construct a scalable distributed recommender system. The experimental 
data show that the distributed CF-based recommender system has much better scalability than traditional 
centralized ones with comparable prediction efficiency and accuracy. 

Author Keywords: Recommender system; Collaborative filtering; Peer-to-Peer; Significance refinement; 
Unanimous amplification 
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1. Introduction 

Recommender system is a system that helps users to find their wanted items by making recommendations 
based on either the content of the recommended items (Content-based Filtering), or ratings of similar users on 
the recommended items (Collaborative Filtering, CF). Since Goldberg, Nichols, Oki, and Terry (1992) published 
the first account of using CF for information filtering; CF has proved to be one of the most successful techniques 
in recommendation systems by its advantage of that no explicit description of items is needed. The key idea of 
CF is that users will prefer those items that people with similar interests prefer, or even that dissimilar people do 
not prefer, so most CF algorithms can be separated into three steps as addressed by Herlocker, Konstan, 
Borchers, and Riedl (1999): (1) Similarity Weight: weight all users with respect to similarity with the active user, 
which refer to the user whose preferences are to be predicted; (2) Selecting Neighborhoods: select those users 
used to make prediction; (3) Rating Normalization and Prediction Making: normalize and calculate the weighted 
sum of selected users' ratings, then make prediction based on that. According to different techniques used in the 
first part mentioned above, CF algorithms can be divided into two classes: memory-based algorithms and model- 
based algorithms. Breese et al. performed an empirical analysis on both of two kinds of CF algorithms in Breese, 
Heckerman, and Kadie (1998) while Herlock et al. presented an algorithmic framework for performing CF in 
Herlocker et al. (1999). 

GroupLens (Resnick, lacovou, Suchak, Bergstrom, & Riedl, 1994) was the first CF algorithm to automate 
prediction and used a memory-based algorithm. Like most memory-based algorithms, GroupLens need to 
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compute across the whole user database to calculate the similarities between active user and other users to 
make prediction. Ringo ( Shardanand & Maes, 1995) only used those neighbors whose correlation were greater 
than a given threshold to make prediction. This approach not only reduced the calculation complexity but also 
proved to improve the performance. By choosing top-N users with the highest correlations the same 
improvement can also been obtained. However, all the other users' similarities still have to be calculated and its 
complexity increased quickly both in time and space as the record in the database increases. 

Basically, there are two ways to reduce this calculation complexity. The first one is used a model-based 
algorithm which first constructs some certain mathematical models, such as Bayesian Network, Bayesian 
Classifiers et al., to describe the users and/or their ratings, then learns these models from the database and use 
them to make prediction. However, these approaches also need complex calculation when compiling models and 
also require a central database to keep all the user data which is not easy to achieve sometime not only for 
techniques reasons but also for privacy reasons. 

The second way is to implement CF in a decentralized way. In fact, as Peer-to-Peer (P2P) gains more and more 
popularity, some researchers have already begun to consider it as an alternative architecture to reduce the 
calculation complexity (Tveit, 2001; Olsson, 2003 and Canny, 2002) of centralized CF algorithms. The main 
difference between centralized CF-based recommender system and distributed ones is that the originally 
centralized user database are maintained in a decentralized way which means each peer will only keep a fraction 
of user database and when making prediction for a particular user, needed record should first be retrieved to the 
user's own database from other peers and calculated locally. In order to do this, the following two problems have 
to be addressed: (1) how to store the user database distributed efficient so that needed information can be found 
efficiently; (2) how to identify those records needed to make prediction for a particular user and fetch them 
efficiently as retrieving all other users' votes back is not only unreasonable but also unnecessary. 

The main contributions of this article are: 

(1) We propose PipeCF: a distributed CF algorithm which can be implemented on a P2P overlay network; 

(2) We propose two novel approaches: significance refinement (SR) and unanimous amplification (UA), to 
improve the performance of our distributed CF algorithm; 

(1 ) We give the framework of implementing our distributed CF algorithm on P2P overlay network through 
distributed hash table (DHT) based technique to obtain efficient user database management and retrieval to 
construct decentralized CF recommender system. 

The rest of this article is organized as follows. In Section 2, several related works are presented and discussed. 
In Section 3, we introduce the architecture and key features of our DHT-based CF system. Two techniques: SR 
and UA are also proposed in this section to improve the scalability and prediction accuracy of DHT-based CF 
algorithm in this section. In Section 4 the experimental results of our system are presented and analyzed. Finally 
we make a brief concluding remark and give the future work in Section 5. 

2. Related work 

2.1. Memory-based CF algorithm 

Generally, the task of CF is to predict the votes of active users from the user database which consists of a set of 
votes v. . corresponding to the vote of user / on item j. Memory-based CF algorithm calculates this prediction of 
as a weighted average of other users votes on that item through the following formula: 
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P*J = + * Z m (*" N V IJ ~ f '<> (1 ) 

where P • denotes the prediction of the vote for active user a on item; and n is the number of users in user 
database, vj is the mean vote for user / as: 

where /,. is the set of items on which user /' has voted. The weights xn{aj) reflect the similarity between active 
user and users in the user database, k is a normalizing factor to make the absolute values of the weights sum to 
unity. 

Most memory-based algorithms use Eq. (1) to make prediction and only distinguish between the ways they 
calculate the weights: 

2.1.1. Pearson correlation coefficient 

Pearson correlation coefficient was first introduced into collaborative filtering as a weighting method in the 
GroupLens project The correlation between user a and / is: 

m{a. t) ~ \ _ (3) 

where the summations is calculated over those items for which both users a and / have voted. 

2.1.2. Vector similarity 

The vector similarity was first used to measure the similarity between two documents. Each document was 
viewed as a vector of word frequency and their similarity was computed as the cosine of the angle between 
these two vectors. In Collaborative Filtering, we treat each user record as a document and their votes as 
frequency of items. So the weights can now be calculated as: 



2.2. P2P system and DHT routing algorithm 



(4) 



The term 'Peer-to-Peer* refers to a class of systems and applications that employ distributed resources to 
perform a critical function in a decentralized manner. With the pervasive deployment of computers, P2P is 
increasingly receiving attention in research and more and more P2P systems have been deployed on the 
Internet. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on 
centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients; 
and enabling resource aggregation. Among all these applications, three main classes of peer-to-peer 
applications have emerged: parallelizable, content and file management, and collaborative. 

As the main purpose of P2P systems are to share resources among a group of computers called peers in a 
distributed way, efficient and robust routing algorithms for locating wanted resource is critical to the performance 
of P2P systems. Among these algorithms, distributed hash table (DHT) algorithm is one of .the most popular and 
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effective and supported by many P2P systems such as CAN (Ratnasamy, Francis, Handley, Karp, & Shenker, 
2001), Chord ( Stocal et al., 2001), Pastry ( Rowstron & Druschel, 2001), and Tapestry ( Zhao et al., 2001). 

A DHT overlay network is composed of several DHT nodes and each node keeps a set of resources (e.g. files, 
rating of items). Each resource is associated with a key (produced, for instance, by hashing the file name) and 
each node in the system is responsible for storing a certain range of keys. Peers in the DHT overlay network 
locate their wanted resource by issue a lookup(key) request which returns the identity (e.g. the IP address) of the 
node that stores the resource with the certain key. The primary goals of DHT are to provide an efficient, scalable, 
and robust routing algorithm which aims at reducing the number of P2P hops, which are involved when we locate 
a certain resource, and to reduce the amount of routing state that should be preserved at each peer. In Chord 
(Stocal et al., 2001 ), each peer keeps track information of logN other peers (N is the total number of peers in the 
community). When a peer joins and leaves the overlay network, this highly optimized version of DHT algorithm 
will only require notifying logN peers about that change. 

3. Our distributed CF algorithm 
3.1. Basic PipeCF algorithm 

The first step to implement CF algorithm in a distributed way is to divide the original centralized user database 
into fractions which can be stored in Peers distributed For concision, we use the term bucket to denote one 
fraction of the whole user database in the following of this article. Each bucket should also be assigned an 
identifier through which they can be located later when needed. The way we do the division is to make each 
bucket hold a group of users' record who has at least rated one item with the same vote. By that we construct 
one bucket for every different <ITEM_ID, VOTE> tuple and use that tuple as the identifier for the bucket. Fig. 1 
shows our division strategy: 





LJ Display Full Size version of this image (4K) 



Fig. 1. User database division strategy. 



In order to reduce the calculation complexity and achieve scalability, we wish to find a strategy which can choose 
neighbors from the most suitable buckets when making prediction. The PipeCF algorithm chooses neighbors 
based on the heuristic that people with similar interests at least rate one item with similar votes. As we can see in 
Fig. 2, this strategy have very high hitting ratio. So when making prediction, the PipeCF only uses those users' 
records who are in at least one same bucket with the active users. Through which we reduce about 50% 
calculation than traditional CF algorithm and obtain comparable prediction as shown in Figure 8. 
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Fig. 2. Architecture of DHT-based CF recommender system. 
3.2. Improved PipeCF algorithm 

3.2.1. Significance refinement 

In the basic PipeCF algorithm, we use all users which are in the same bucket with the active user and find that 
the algorithm has an O(N) fetched user number (N is the total user number) as Fig. 8 shows. In fact, as Breese 
et al. (1998) presented by the term inverse user frequency, universally liked items are not as useful as less 
common items in capturing similarity. So we introduce a new concept SR, which reduces the returned user 
number of the basic PipeCF algorithm by limiting the number of returned users for each bucket. We term the 
algorithm improved by SR as Return K which means 'for every item, the PipeCF algorithm returns no more than 
K users for each bucket'. The experimental result in Section 5.3.3 shows that this method reduces the returned 
user number dramatically and also improves the prediction accuracy. 

3.2.2. Unanimous amplification 

Enlightened by the method of case amplification (Breese et al., 1 998) which emphasizes the contribution of the 
most similar users to the prediction by amplifying the weights close to 1 , we argue that we should give special 
award to the users who rated some items with the same vote by amplify their weights, which we term UA We 
transform the estimated weights as follows: 



improves the prediction accuracy of the PipeCF algorithm. 

4. DHT-BASED CF recommender system 

4.1. Architecture of DHT-based CF recommender system 

The main advantage of our DHT-based CF recommender system vs. traditional centralized CF recommender 
system is that both the maintenance of user database and the complex computation task of making prediction 
are done in a decentralized way so as to obtain better scalability. Here, we treat each bucket as resource and the 
unique key for the resource is generated by the particular <ITEMJD, VOTE> tuple associated with the bucket. 
Each peer in the DHT overlay network will keep one or several buckets locally. Fig. 2 gives the architecture of 
our DHT-based CF recommender system. 

So the implementation of PipeCF in DHT overlay network is straightforward except that the bucket is stored 
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our experiments is 2.0, p is 4.0, and y is 4. Experimental result in Section 4.3.4 shows that UA approach 
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distributed so that when a user wants to look up other similar users which have the same particular <ITEM JD, 
VOTE> tuple, it need fetch them from DHT overlay network. Still we need special scheme to manage the 
distributed storage of buckets. DHT has provided two function lookup(key) and put(key) which we will describe 
later to do these jobs and its efficiency has been guaranteed by the algorithm itself. So with the DHT overlay 
network, all the users in the CF system are connected together and can find their wanted similar neighbors 
efficiently through a DHT routing algorithm. 

4.2. Implementation of PipeCF on DHT 

On the basis of the decentralized storage of user votes, we introduce our implementation of PipeCF algorithm on 
the DHT overlay network, called DHT-based CF algorithm, as shown in Fig. 3. 



tm t**,»i— «JBT mmm 
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Fig. 3. DHT-based CF algorithm. 

There are two key pieces to the DHT-based CF system algorithm: the lookup mechanism used to locate similar 
users and fetch their actual rating. The decentralized storage (and hence decentralized retrieval) in decentralized 
CF system makes the CF calculation inherently scalable (every user do recommendation locally instead of 
depending on a centralized server); the hard part is finding the similar peers from which to retrieve the actual 
rating. 

We devise a scalable solution to the problem of locating similar users in decentralized CF system, i.e. give a user 
vote vector; we can find the IP address of the node(s) which is similar to the user. Our DHT-based solution can 
reach following goal: 

• Scalability: it must be designed to scale to several million nodes. 

• Efficiency: similar users should be located reasonably quick and with low overhead in terms of the message 
traffic generated. 

• Dynamicity: the system should be robust to frequent node arrivals and departures in order to cope with highly 
transient user populations' characteristic to decentralized environments. 

• Balanced load: in keeping with the decentralized nature, the total resource load (traffic, storage, etc) should be 
roughly balanced across all the nodes in the system. 



So we only select similar users in the subset in which users have same <ITEMJD, VOTE> tuple. The key idea of 
our algorithm is hashing every user for every rated item. Our DHT-based CF algorithm includes two main DHT 
functions: put(key) and lookup(key), and Fig. 4 and Fig. 5 show them separately. 
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Fig. 4. DHT-based CF put function. 
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Fig. 5. DHT-based CF lookup function. 

DHT-based CF Put algorithm is used to construct DHT overlay network and fill data in it. DHT-based CF Lookup 
algorithm is used to lookup and fetch similar uses with same <ITEMJD, VOTE> tuple in order to construct a 
local training set to make recommendation. The main purpose of steps 2 and 3 in Fig. 3 is to make every peer in 
the DHT overlay network keep several buckets which contain a group of users with same <ITEM_ID, VOTE> 
tuple, from which the Lookup algorithm can fetch similar users later in its steps 2 and 3. 

5. Experimental evaluation 

In this section, we describe the dataset, metrics and methodology for the comparison between traditional and 
DHT-based CF algorithm, and present the results of our experiments. 

5.1. Data set 

We use Eachmovie collaborative filtering data set (1997) to evaluate the performance of improved algorithm The 
EachMovie data set.is provided by the Compaq System Research Center, which ran the EachMovie 
recommendation service for 18 months to experiment with a collaborative filtering algorithm. The information they 
gathered during that period consists of 72,916 users, 1628 movies, and 2,811,983 numeric ratings ranging from 
0 to 5. To speed up our experiments, we only use a subset of the EachMovie data set. 

5.2. Metrics and methodology 



The metrics for evaluating the accuracy of a prediction algorithm can be divided into two main categories: 
statistical accuracy metrics and decision-support metrics. Statistical accuracy metrics evaluate the accuracy of a 
predictor by comparing predicted values with user-provided values. Decision-support accuracy measures how 
well predictions help user select high-quality items. We use Mean Absolute Error (MAE), a statistical accuracy 
metrics, to report prediction experiments for it is most commonly used and easy to understand: 



MAE 



(6) 
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where v ■ is the rating given to item / by user a, is the predicted value of user a on item j, T is the test set, |71 is 
the size of the test set. 

We select 2000 users and choose one user as active user per time and the remainder users as his candidate 
neighbors, because every user only make self s recommendation locally. We use the mean prediction accuracy 
of all the 2000 users as the system's prediction accuracy. For every user's recommendation calculation, our tests 
are performed using 80% of the user's ratings for training, with the remainder for testing. 

5.3. Experimental result 

We design several experiments for evaluating our algorithm and analyze the effect of various factors (e.g. SR 
and UA, etc.) by comparison. All our experiments are run on a Windows 2000-based PC with Intel Pentium 4 
processor having a speed of 1.8 GHz and 512 MB of RAM. 

5.3.1. The efficiency of neighbor choosing 

We used a data set of 5000 users and show among the users chosen by PipeCF algorithm, how many are in the 
top-100 users in Fig. 6. We can see from the data that when the user number rises above 1000, more than 80 
users who have the most similarities with the active users are chosen by PipeCF algorithm. 



Fig. 6. How many users of PipeCF in traditional CF's top 100. 
5.3.2. Performance comparison 

We compare the prediction accuracy of traditional CF algorithm and PipeCF algorithm while we apply both top-all 
and top-100 user selection on them. The results are shown as Fig. 7. We can see that the DHT-based algorithm 
has better prediction accuracy than the traditional CF algorithm. 
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Fig. 7. PipeCF vs. traditional CF. 



5.3.3. The effect of significance refinement 
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We limit the number of returned user for each bucket by 2 and 5 and do the experiment in Section 5.3.2 again. 
The user for each bucket is chosen randomly. The result of the number of user chosen and the prediction 
accuracy is shown in Fig. 8 and Fig. 9, respectively. The result shows: 

(1) 'Return All' has an O(N) returned user number and Its prediction accuracy is also not satisfying; 

(2) 'Return 2' has the least returned user number but the worst prediction accuracy; 

(1) 'Return 5' has the best prediction accuracy and the scalability is still reasonably well (the returned user 
number is still limited to a constant as the total user number increases). 
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Fig. 8. The effect on scalability of SR on PipeCF. 
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Fig. 9. The effect on prediction accuracy of SR on PipeCF algorithm. 
5.3.4. The effect of unanimous amplification 

We adjust the weights for each user by using Eq. (5) while setting value for a as 2.0, p as 4.0, y as 4 and do the 
experiment in Section 4.3.2 again. We use the top-100 and 'Return AH' selection method. The result shows that 
the UA approach improves the prediction accuracy of both the traditional and the PipeCF algorithm. From Fig. 10 
we can see that when UA approach is applied, the two kinds of algorithms have almost the same performance. 
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Fig. 10. The effect on prediction accuracy of unanimous amplification. 
6. Conclusion and future work 



In this article, we propose a novel distributed hash table (DHT) based technique to implement efficient user 
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database management and retrieval in decentralized CF system. Then we propose a heuristic algorithm to fetch 
similar users from DHT overlay network and do recommendation locally. Finally, we propose two novel 
approaches: SR and UA to improve the performance of our DHT-based CF algorithm. The experimental data 
show that our DHT-based CF system has better prediction accuracy, efficiency and scalability than traditional CF 
systems. 

Our future work includes investigation on a more efficient decentralized user database management and K- 
Nearest Neighbor (KNN) methods which can dynamically self-organize users with semantic similar interests 
combining content-based filtering techniques. We would also like to investigate on the influence of parameters 
choosing in UA. 
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Abstract 

Collaborative Filtering (CF) technique has been proved to be one of the most successful techniques in 
recommender systems in recent years. However, most existing CF based recommender systems worked in a 
centralized way and suffered from its shortage in scalability as their calculation complexity increased quickly both 
in time and space when the record in user database increases. In this article, we first propose a distributed CF 
algorithm called PipeCF together with two novel approaches: significance refinement and unanimous 
amplification, to further improve the scalability and prediction accuracy. We then show how to implement this 
algorithm on a Peer-to-Peer (P2P) structure through distributed hash table method, which is the most popular 
and efficient P2P routing algorithm, to construct a scalable distributed recommender system. The experimental 
data show that the distributed CF-based recommender system has much better scalability than traditional 
centralized ones with comparable prediction efficiency and accuracy. 
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